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Abstract 

The mathematics of “Collective INtelli- 
gence” (COINs) is concerned with the de- 
sign of multi-agent systems so as to optimize 
an overall global utility function when those 
systems lack centralized communication and 
control (Jennings et al., 1998; Sen, 1997; 
Sycara, 1998). Typically in COINs each 
agent runs a distinct Reinforcement Learning 
(RL) algorithm (Kaelbing et al., 1996; Sut- 
ton k Barto, 1998; Watkins k Dayan, 1992), 
so that much of the design problem reduces 
to how best to initialize/update each agent's 
private utility function, as far as the ensuing 
value of the global utility is concerned. Tra- 
ditional “team game'' solutions to this prob- 
lem assign to each agent the global utility as 
its private utility function (Crites k Barto, 
1996). In previous work we used the COIN 
framework to derive the alternative “Wonder- 
ful Life Utility’ 1 (WLU), and experimentally 
established that having the agents use it in- 
duces global utility performance up to orders 
of magnitude superior to that induced by use 
of the team game utility. The WLU has a 
free parameter (the “clamping parameter”) 
which we simply set to 0 in that previous 
work. Here we derive the optimal value of the 
claimping parameter, and demonstrate exper- 
imentally that using that optimal value cam 
result in significantly improved performance 
over that of clamping to 0, over and above the 
improvement beyond traditional approaches. 

1. Introduction 

In this paper we are interested in multi-agent sys- 
tems (MAS's (Jennings et al., 1998; Sen, 1997; Sycara, 
1998)) having the following characteristics: 

• the agents each run reinforcement learning (RL) al- 
gorithms; 


• there is little to no centralized communication or 
control; 

• there is a provided world utility function that rates 
the possible histories of the full system. 

These kinds of problems may well be most readily 
addressed by having each agent run a Reinforcement 
Learning (RL) adgorithm. In such a system, we are 
confronted with the inverse problem of how to ini- 
tialize/update the agents' individual utility functions 
to ensure that the agents do not “work at cross- 
purposes", so that their collective behavior maximizes 
the provided global utility function. Intuitively, we 
need to provide the agents with utility functions they 
can learn well, while ensuring that their doing so won't 
result in economics phenomena like the Tragedy of 
The Commons (TOC; (Hardin, 1968)), liquidity trap 
or Bra^ss’ paradox (Turner k Wolpert, 2000). 

This problem is related to work in many other fields, 
including computational economics, mechanism de- 
sign, reinforcement learning for adaptive control, sta- 
tistical mechainics, computational ecologies, and gaime 
theory, in particular, evolutionary game theory. How- 
ever none of these fields directly aiddresses the inverse 
problem. (This is even true for the field of mechanism 
design; see (Wolpert k Turner, 2000a) for a detailed 
discussion of the relationship between these fields, in- 
volving severad hundred references.) 

Other previous work involves MAS's where agents 
use reinforcement learning (Boutilier, 1999; Claus k 
Boutilier, 1998), and/or where agents model the be- 
havior of other agents (Hu k Wellmam, 1998). Typ- 
icadly this work simply elects to provide each agent 
with the global utility function as its private utility 
function, in a so-cadled “exact potential” or “team” 
game. Unfortunately, as expounded below, this can 
result in very poor global performance in laurge prob- 
lems. Intuitively, the difficulty is that each agent can 
have a hard time discerning the echo of its behavior 
on the global utility when the system is large. 




In previous work we used the COIN framework to de- 
rive the alternative “Wonderful Life Utility” (WLU) 
(Wolpert k Turner, 2000a), a utility that generically 
avoids the pitfalls of the team game utility. In some 
of that work we used the WLU for distributed con- 
trol of network packet routing (Wolpert et al., 1999a). 
Conventional approaches to packet routing have each 
router run a shortest path algorithm (SPA), i.e., each 
router routes its packets in the way that it expects will 
get those packets to their destinations most quickly. 
Unlike with a COIN, with SPA-based routing the 
routers have no concern for the possible deleterious 
side-effects of their routing decisions on the global goal 
(e.g., they have no concern for whether they induce 
bottlenecks). We ran simulations that demonstrated 
that a COIN-based routing system has substantially 
better throughputs than does the best possible SPA- 
based system (Wolpert et al., 1999a), even though that 
SPA-based system has information denied the COIN 
system. In related work we have shown that use of the 
WLU automatically avoids the infamous Braess’ para- 
dox, in which adding new links can actually decrease 
throughput — a situation that readily ensnares SPA’s. 

Finally, in (Wolpert et al., 1999b) we considered the 
pared-down problem domain of a congestion game, in 
particular a more challenging variant of Arthur’s El 
Farol bar attendance problem (Arthur, 1994), some- 
times also known as the “minority game” (Challet k 
Zhang, 1998). In this problem, agents have to deter- 
mine which night in the week to attend a bar. The 
problem is set up so that if either too few people attend 
(boring evening) or too many people attend (crowded 
evening), the total enjoyment of the attendees drops. 
Our goal is to design the reward functions of the at- 
tendees so that the total enjoyment across ail nights is 
maximized. In this previous work of ours we showed 
that use of the WLU can result in performance orders 
of magnitude superior to that of team game utilities. 

The WLU has a free parameter (the “clamping pa- 
rameter”), which we simply set to 0 in our previous 
work. To determine the optimal value of that pa- 
rameter we must employ some of the mathematics of 
COINs, whose relevant concepts we review in the next 
section. We next use those concepts to sketch the cal- 
culation deriving the optimal clamping parameter. To 
facilitate comparison with previous work, we chose to 
conduct our experimental investigations of the perfor- 
mance with this optimal clamping parameter in varia- 
tions of the Bar Problem. We present those variations 
in Section 3. Finally we present the results of the 
experiments in Section 4. Those results corroborate 
the predicted improvement in performance when using 
our theoretically derived clamping parameter. This 


extends the superiority of the COIN-based approach 
above conventional team-game approaches even fur- 
ther than had been done previously 

2. Theory of COINs 

In this section we summarize that part of the theory 
of COINs presented in (Wolpert et al., 1999a; Wolpert 
k Turner, 2000a; Wolpert et al., 1999b) that is rele- 
vant to the study in this article. We consider the state 
of the system across a set of consecutive time steps, 
t £ {0, 1,...}. Without loss of generality, all relevant 
characteristics of agent 77 at time t — including its in- 
ternal parameters at that time as well as its externally 
visible actions — are encapsulated by a Euclidean vec- 
tor C the state of agent 77 at time t. C * is the set of 

—1 rj,c — ' c 

the states of all agents at t, and ( is system’s worldline, 
i.e., the state of ail agents across ail time. 

World utility is G(Q, and when 77 is an ML algo- 
rithm “striving to increase” its private utility, we 
write that utility as 7 „(£). (The mathematics can 
readily be generalized beyond such ML-based agents; 
see (Wolpert k Turner, 2000b) for details.) Here we 
restrict attention to utilities of the form t ) ^ or 

reward functions Rt- 

We are interested in systems whose dynamics is deter- 
ministic. (This covers in particular any system run on 
a digital computer, even one using a pseudo-random 
number generator to generate apparent stochasticity.) 
We indicate that dynamics by writing C = C(£ 0 ). So 
all characteristics of an agent 77 at t = 0 that affects the 
ensuing dynamics of the system, including its private 
utility, must be included in q • 

Definition: A system is factored if for each agent 77 
individually, 

7 „(C'(£, 0 )) >7,(C(£ 0 )) ~ G(C^ 0 )) > G(C(Q ) , 

for all pairs and £' 0 that differ only for node 77 . 

For a factored system, the side effects of changes to 
tj’s t = 0 state that increase its private utility cannot 
decrease world utility. If the separate agents have high 
values of their private utilities, by luck or by design, 
then they have not frustrated each other, as far as G is 
concerned. (We arbitarily phrase this paper in terms 
of changes at time 0; the formalism is easily extended 
to deal with arbitrary times.) 

The definition of factored is carefully crafted. In par- 
ticular, it does not concern changes in the value of the 
utility of agents other than the one whose state is var- 
ied. Nor does it concern changes to the states of more 


than one agent at once Indeed, consider the following 
alternative desideratum to having the system be fac- 
tored: any change to C that simultaneously improves 
the ensuing values of all the agents' utilities must also 
unprove world utility Although it seems quite rea- 
sonable, there are systems that obey this desideratum 
and yet quickly evolve to a mimmum of world utility 
((Wolpert et al., 1999b)). 

For a factored system, when every agents’ private util- 
ity is optimized (given the other agents’ behavior), 
world utility is at a critical point (Wolpert k Turner, 
2000a). In game-theoretic terms, optimal global be- 
havior occurs when the agents' are at a private utility 
Nash equilibrium (Fudenberg k Tirole, 1991). Accord- 
ingly, there can be no TOC for a factored system. 

As a trivial example, if 7„ = G V77, then the system is 
factored, regardless of C. However there exist other, 
often preferable sets of {7,,}, as we now discuss. 

Definition: The (t = 0 ) effect set of node 77 at £, 
is the set of all components f , for which 

the gradients V^ o (C(C 0 Dn'.t' # 0. C$ ff with no 
specification of £ is defined as U<;gcC^(£)- 

Intuitively, the effect set of 7 is the set of all node-time 
pairs affected by changes to 77 ’s t = 0 state. 

Definition: Let a be a set of agent-time pairs. 

CL ff (£) is £ modified by “clamping” the states cor- 
responding to all elements of <7 to some arbitrary pre- 
fixed value, here taken to be 0. The wonderful life 
utility (WLU) for a at £ is defined as: 

WLU„(Q s G(£) - G(CL„(£)) . (1) 

In particular, the WLU for the effect set of node 77 is 
G(£)-G(CL c ., ,(£)). 

A node 77’s effect set WLU is analogous to the change 
world utility would undergo had node 77 “never ex- 
isted”. (Hence the name of this utility - cf. the Frank 
Capra movie.) However CL(.) is a purely “fictional”, 
counter-factual mapping, in that it produces a new £ 
without taking into account the system’s dynamics. 
The sequence of states produced by the clamping op- 
eration in the definition of the WLU need not be con- 
sistent with the dynamical laws embodied in C. This 
is a crucial strength of effect set WLU. It means that 
to evaluate that WLU we do not try to infer how the 
system would have evolved if node 77’s state were set 
to 0 at time 0 and the system re-evolved. So long as 
we know G and the full £, and can accurately esti- 
mate what agent-time pairs comprise C* ,f , we know 
the value of 77’s effect set WLU — even if we know 


nothing of the details of the dynamics of the system. 

Theorem 1: A COIN is factored if 7 n — W LU c y t 'ii] 
(proof in (Wolpert k Turner, 2000a)). 

If our system is factored with respect tosome {7^ } , 
then each C „ should be in a state with as high a 
value of 7 t? (C(C 0 )) as possible. So for such systems, 
our problem is determining what {7^} the agents will 
best be able to maximize while also causing dynamics 
that is factored with respect to those { 7 ^ } 

Now regardless of C(.), both 7^ = G V 77 and 7 n = 
WLU^/t V 77 are factored systems. However since each 
agent is operating in a large system, it may experi- 
ence difficulty discerning the effects of its actions on G 
when G sensitively depends on all components of the 
system. Therefore each 77 may have difficulty learning 
how to achieve high 7 ^ when 7 v — G. This problem 
cam be obviated by using effect set WLU, since the 
subtraction of the clamped term removes some of the 
“noise” of the activity of other agents, leaving only the 
underlying “signal” of how agent 77 affects its utility. 

We cam quantify this signal/noise effect by comparing 
the ramifications on the private utilities arising from 
changes to £ 0 with the ramifications arising from 
changes to £. , where 7 represents all nodes other 

than 77 . We call this quamtification the learnability 
of those utilities at the point £ = 0 ) (Wolpert & 

Turner, 2000a). A linear approximation to the leam- 
ability in the vicinity of the worldline _£ is the differ- 
ential learnability A,^^): 

A (Q. 'VW I, (2, 

' ||V S>I -,,(C(£ 0 ))|[ 

Differential learnability captures the signal-to-noise 
advantage of the WLU in the following theorem: 

Theorem 2: Let a be a set containing G* // Then 

Kwtv.iQ 

KaiQ ||V i> o G(C(£ 0 )) - V i> o G(CL <r (C(£ 0 )))|| 


(proof in (Wolpert k Turner, 2000a)). This ratio of 
gradients should be large whenever a is a small part 
of the system, so that the clamping won’t affect G’s 
dependence on £ 0 much, and therefore that depen- 
dence will approximately cancel in the denominator 
term. In such cases, WLU is factored, just as G is, 


but far more learnable. The experiments presented 
below illustrate the power of this fact in the context of 
the bar problem, where one can readily approximate 
effect set VVLU and therefore use a utility for which 
the conditions in Thm ’s L and 2 should hold. 

3. The Bar Problem 

Arthur's bar problem (Arthur, 1994) can be viewed 
as a problem in designing COINs. Loosely speak- 
ing, in this problem at each time t each agent r] de- 
cides whether to attend a bar by predicting, based on 
its previous experience, whether the bar will be too 
crowded to be "rewarding 1 ’ at that time, as quanti- 
fied by a reward function Re The greedy nature of 
the agents frustrates the global goal of maximizing Re 
at t. This is because if most agents think the atten- 
dance will be low (and therefore choose to attend), the 
attendance will actually be high, and vice-versa. We 
modified Arthur's original problem to be more general, 
and since we are not interested here in directly com- 
paring our results to those in (Arthur, 1994; Challet 
& Zhang, 1998), we use a more conventional ML al- 
gorithm than the ones investigated in (Arthur, 1994; 
Caldarelli et al., 1997; Challet & Zhang, 1998). 

There are N agents, each picking one of seven nights 
to attend a bar in a particular week, a process that is 
then repeated for the following weeks. In each week, 
each agent’s pick is determined by its predictions of 
the associated rewards it would receive if it made that 
pick. Each such prediction in turn is based solely upon 
the rewards received by the agent in those preceding 
weeks in which it made that pick. 

The world utility is G(£) = Ee^c(C t )> where 
Rg( C,) = £l=i #**(£• *))> is the total at- 

tendance on night k at week £, <p(y) 2 yexp(-yfc); 
and c is a real- valued parameter. Our choice of <p(.) 
means that when too few agents attend some night in 
some week, the bar suffers from lack of activity and 
therefore the world reward is low. Conversely, when 
there are too many agents the bar is overcrowded and 
the reward is again low. 

Since we are concentrating on the choice of utilities 
rather than the RL algorithms that use them, we 
use simple RL algorithms. Each agent rj has a 7- 
dimensional vector representing its estimate of the re- 
ward it would receive for attending each night of the 
week. At the beginning of each week, to trade off 
exploration and exploitation, rj picks the night to at- 
tend randomly using a Boltzmann distribution over 
the seven components of rj's estimated rewards vec- 
tor. For simplicity, temperature did not decay in time. 


However to reflect the fact that each agent perceives 
an environment that is changing in time, the reward 
estimates were formed using exponentially aged data: 
in any week £, the estimate agent rj makes for the re- 
ward for attending night i is a weighted average of all 
the rewards it has previously received when it attended 
that night, with the weights given by an exponential 
function of how long ago each such reward was. 

To form the agents’ initial training set, we had an ini- 
tial training period in which all actions by all agents 
were chosen uniformly randomly, and the associated 
rewards recorded by all agents. After this period, the 
Boltzmann scheme outlined above was “turned on 1 ’. 

This simple RL algorithm works with rewards rather 
than full-blown utilities. So formally speaking, to ap- 
ply the COIN framework to it it is necessary to extend 
that framework to encompass rewards in addition to 
utilities, and in particular to concern effect set won- 
derful life reward (WLR), whose value at moment t 
for agent rj is Rc{£ t ) — Rg(CL c *// (£j)). To do this 
one uses Thm. 1 to prove that, under some mild as- 
sumptions, if we have a set of private rewards that are 
factored with respect to world rewards, then maximiz- 
ing those private rewards also maximizes the full world 
utility. In terms of game theory, a Nash equilibrium 
of the single-stage game induces a maximum of the 
world utility defined over the entire multi-stage game. 
(Intuitively, this follows from the fact that the world 
utility is a sum of the world rewards.) In addition, 
one can show that the WLR is factored with respect 
to the world reward, and that it has the same advan- 
tageous leamability characteristics that accrue to the 
W r LU. Accordingly, just as the COIN framework rec- 
ommend we use WLU when dealing with utility-based 
RL algorithms, it recommends that we use WLR in 
the b as problem when dealing with reward-based RL 
algorithms. See (Wolpert & Turner, 2000b). 

Example: It is worth illustrating how the WLR is fac- 
tored with respect to the world reward in the context 
of the bar problem. Say we’re comparing the action 
of some particular agent going on night 1 versus that 
agent going on night 2, in some pre-fixed week. Let x\ 
and x 2 be the total attendances of everybody but our 
agent, on nights 1 and 2 of that week, respectively. So 
WLR( 1), the WLR value for the agent attending night 
1, is given by 4>(x\ + 1) — <p{x\ + CL \ ) +<p{x ( z) — <t >{ x 2 + 
CL 2 ) + £ <>2 [0(*i) - + CLi)l where U CX<" is 

the i’th component of our clamped vector Similarly, 
WLR( 2) = 4>{x\) - 0(*i + CLi) + <p{x' 2 + 1) - 4* 

Combining, sgn(WLR{ 1) - WLR( 2)) = sgn(<t>(x[ + 


1) - - <2>(x' 2 + 1) 4 - <t>{x',)). On the other hand, 

/? G (l), the G value for the agent attending night 1, 
is <A(X( 4 - l) + <t>(x' 2 ) + Y.i>i «K X *)- Similarly, R c ,{ 2) is 
(p(.c[)+Cp(x 2 + 1) + II,>2 (p(x i ) . Therefore 3gn(Rc(l) — 
Rc( 2)) = sgn(<t>(£\ + 1)4- -hi) - d>(x\)). 

So sgn{WLR( l) - H r Lfi(2) = a®n(/?c(l) - «c(2)). 
This is true for any pair of nights, and any attendances 
{ji}, and any clamping vector. This establishes the 
claim that WLR is factored with respect to the world 
reward, for the bar problem. 

When using the WLR we are faced with the question 
of setting the clamping parameter, i.e., of determining 
the best values to which to clamp the components 
of C ■ One way to do this is to solve for those values that 
maximize differential learnability. An approximation 
to this calculation is to solve for the clamping parame- 
ter that minimizes the expected value of [A r?i wz,/i fr ]~ 2 * 
where the expectation is over the values C t and asso- 
ciated rewards making up rj's training set. 

A number of approximations have to be made to carry 
out this calculation. The final result is that r) should 
clamp to its (appropriately data-aged) empirical ex- 
pected average action, where that average is over the 
elements in its training set (Wolpert & Turner, 2000b). 
Here, for simplicity, w f e ignore the data-aging stipu- 
lation of this result. Also for simplicity, we do not 
actually make sure to clamp each r) separately to its 
own average action, a process that involves 77 modify- 
ing what it clamps to in an online manner. Rather we 
choose to clamp all agents to the same vector, where 
that vector is an initial guess as to what the average 
action of a typical agent will be. Here, where the ini- 
tial training period has each agent choose its action 
uniformly randomly, that guess is just the uniform av- 
erage of all actions. The experiments recounted in the 
next section illustrate that even using these approx- 
imations, performance with the associated clamping 
parameter is superior to that of using the WL reward 
with clamping to 0, which in turn exhibits performance 
significantly superior to use of team game rewards. 

4. Experiments 

4.1 Single Night Attendance 

Our initial experiments compared three choices of the 
clamping parameter: Clamping to “zero” i.e., the ac- 
tion vector given by 0 = (0, 0, 0, 0, 0, 0, 0), as in our 
original work; clamping to “ones” i.e., the action vec- 
tor F = (1, 1, 1, 1, 1, 1, 1); and clamping to the (ideal) 
“average” action vector for the agents after the initial 
training period, denoted by a. Intuitively, the first 


clamping is equivalent to the agent “staying at home,” 
while the second option corresponds to the agent at- 
tending every night. The third option is equivalent 
to the agents attending partially on all nights in pro- 
portions equivalent to the overall attendance profile of 
ail agents across the initial training period. (If taken, 
this “action” would violate the dynamics of the sys- 
tem, but because it is a fictional action as described in 
Section 2, it is consistent with COIN theory.) 


In order to distinguish among the different clamping 
operators, we will include the action vector to which 
the agents are clamped as a subscript (e g., CL° will 
denote the operation where the action is clamped to 
the zero vector). Because all agents have the same 
reward function in the experiments reported here, we 
will drop the agent subscript from the reward function. 


We compared performance with these three WLR’s 
and the team game reward, Rg Writing them out, 
those three WLR reward functions are: 


c t ) = 

RwL : (£ t ) = 

RwL 3 (i' t ) : 


+ 


Rg ( C t ) - Rc(CL *( C t )) 

<?d,( x d,(C.O) - <Pd n (x<t n (£t) - 1) 

Rg( i t ) - RG(CLl(( t )) 

7 

( ) + !) 

djtd„ 

Rc(i :t ) ~ RaiCLfej) 

7 

<M x «f(C. <)) “ <t>d(xd{C t) + a d ) 
<t>d n (Xd n {<i, 0) - <Pd n (x dl {G 0 - 1 + a d) 


where dr, is the night picked by tj, and a d is the com- 
ponent of o corresponding to night d. 

The team game reward, Rg • results in the system 
meeting the desideratum of factoredness. However, be- 
cause of Theorem 2, we expect Rg to have poor leam- 
ability, particularly in comparison to that of Rwl s \ 
(see (Wolpert & Turner, 2000a) for details). Note that 
to evaluate Rwl s each agent only needs to know the 
total attendance on the night it attended. In contrast, 
Rg and RwL t require centralized communication con- 
cerning all 7 nights, and Rwl { requires communicar 
tion concerning 6 nights. 

Figure 1 graphs world reward against time, averaged 
over 100 runs, for 60 agents and c = 3. (Throughout 
this paper, error bars are too small to depict.) The two 
straight lines correspond to the optimal performance, 
and the “baseline” performance given by uniform oc- 
cupancies across all nights. Systems using W La and 



Figure l. Reward function comparison when agents attend 
one night. {W La is O ; W is 4* ; W is □ ; G is x) 

WL$ rapidly converged to optimal and to quite good 
performance, respectively. This indicates that for the 
bar problem the “mild assumptions” mentioned above 
hold, and that the approximations in the derivation of 
the optimal clamping parameter are valid. 

In agreement with our previous results, use of the re- 
ward Rg converged very poorly in spite of its being 
factored. The same was true for the WL^ reward. This 
behavior highlights the subtle interaction between fac- 
toredness and leamability. Because the signal-to-noise 
was higher for these reward functions, it was very dif- 
ficult for individual agents to determine what actions 
would improve their private utilities and therefore held 
difficulties in finding good solutions. 
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Figure 2. Scaling properties of the different reward func- 
tion. (WL* is O ; WL 6 is + ; W L r is □ ; G is x) 

Figure 2 shows how t = 500 performance scales with 
N for each of the reward signals. For comparison pur- 
poses the performance is normalized — for each utility 
U we plot , where Ropt and R^ase are the 

optimal performance and a canonical baseline perfor- 
mance given by uniform attendance across all nights, 
respectively. Systems using Ra perform adequately 


when N is low. As /V increases however, it becomes 
increasingly difficult for the agents to extract the in- 
formation they need from Rg Because of their supe- 
rior leamability, systems using the VVL rewards over- 
come this signal-to-noise problem to a great extent. 
Because the VVL rewards are based on the difference 
between the actual state and the state where one agent 
is clamped, they are much less affected by the total 
number of agents. However, the action vector to which 
agents are clamped also affects the scaling properties. 

4.2 Multiple Night Attendance 

In order to study the relationship between the clamp- 
ing parameter and the resulting world utility in more 
detail, we now modify the bar problem as follows: 
Each week, each agents picks three nights to attend 
the bar. So each of the seven possible actions now cor- 
responds to a different attendance pattern. (Keeping 
the number of candidate actions at 7 ensures that the 
complexity of the RL problem faced by the agents is 
roughly the same.) Here those seven attendance pro- 
files were attending the first three nights, attending 
nights 2 through 4, ..., attending on nights 7, 1 and 2. 

Figure 3 shows world reward value as a function of 
time for this problem, averaged over 100 runs, for all 
four reward functions. For these simulations c = 8, 
and there were 60 agents. Optimal and baseline per- 
formance are plotted as straight lines. Note that in the 
experiments of the previous section CL a clamps to^the 
attendance vector v with components Vi = Yld=\ “ 7 1 * 
where <5^,, is the Kronecker delta function. Now how- 
ever it clamps to Vi = Y^d~\ ^ 7 ^’ where 114,1 is the fth 
component (0 or 1) of the the d'th action vector, so 
that for each d it contains three Vs and four 0's. 



Figure 3. Reward function comparison when agents attend 
three nights. {W L s is O ; W Lq is + ; WL r is □ ; G is x) 

As in the previous case, the reward obtained by clamp- 
ing to the average action Rwl* performs near opti- 
mally. Rwl 6 on the other hand shows a slight drop- 




off compared to the previous problem. Rwl x now 
performs almost as well as Rwl s All three WL re- 
wards still significantly outperform the team game re- 
ward. What is noticeable though is that as the number 
of nights to attend increases, the difference between 
Rw l$ and Rwl f decreases, illustrating how changing 
the problem can change the relative performances of 
the various WL rewards. 

4.3 Sensitivity to Clamping 

The results of the previous section shows that the ac- 
tion vector to which agents clamp has a considerable 
impact on the global performance. In this section we 
study how that dependence varies with changes in the 
problem formulation. 

We considered four additional variants of the bar prob- 
lem just like the one described in the previous sub- 
section, only with four new values for the number of 
nights each agent attends. As in the previous sec- 
tion, we keep the number of actions at seven and map 
those actions to correspond to attending particular 
sets of nights. Also as in the previous section, we 
choose the attendance profiles of each potential action 
so that when the actions are selected uniformly the 
resultant attendance profile is also uniform. We also 
modify c to keep the “congestion” level of the prob- 
lem at a level similar to the original problem. (For 
the number of nights attended going from one to six, 
c = {3, 6, 8, 10, 12, 15} respectively.) 



Figure 4 . Behavior of different reward function with re- 
spect to number of nights to attend. ( W L& is O ; W 

is-f;WLjisO;Gisx) 

Figure 4 shows the normalized world reward obtained 
for the different rewards as a function of the number of 
nights each agent attends. Rwl m performs well across 
the set of problems. Rwi x on the other hand performs 
poorly when agents only attend on a few nights, but 
reaches the performance of Rwl m when agent need to 
select six nights, a situation where the two clamped ac- 


tion vectors are very similar Rwl 6 shows a slight drop 
in performance when the number of nights to attend 
increases, while Rc shows a much more pronounced 
drop. These results reinforce the conclusion obtained 
in the previous section that the clamped action vector 
that best matches the aggregate empirical attendance 
profile results in best performance 

4.4 Sensitivity to Parameter Selection 

The final aspect of these reward functions we study is 
the sensitivity of the associated performance to the in- 
ternal parameters of the learning algorithms. Figure 5 
illustrates experiments in the original ban* problem pre- 
sented in Figures 1 and 2, onl for a set of different 
temperatures in the Boltzamnn distribution. Rwl* 
is fairly insensitive to the temperature, until it is so 
high that agents’ actions are chosen almost randomly. 
R\yl 5 depends more than Rwl a does on having suf- 
ficient exploration and therefore has a narrower range 
of good temperatures. Both Rwl x and have more 
serious learnability problems, and therefore have shal- 
lower and thinner performance graphs. 



Log (Temperature) 

Figure 5 . Sensitivity of reward functions to internal param- 
eters. (W L& is O ; W L$ is 4- ; W L x is □ ; G is x) 


5. Conclusion 

In this article we consider how to configure large multi- 
agent systems where each agent uses reinforcement 
learning. To that end we summarize relevant aspects 
of COIN theory, focussing on how to initialize/update 
the agents’ private utility functions so that their col- 
lective behavior optimizes a global utility function. 

In traditional “team game” solutions to this problem, 
which assign to each agent the global utility as its pri- 
vate utility function, each agent has difficulty discern- 
ing the effects of its actions on its own utility function. 
We confirmed earlier results that if the agents use the 
alternative “Wonderful Life Utility” with clamping to 





0, the system converges to significantly superior world 
reward values than do that associated team game sys- 
tems. We then demonstrated that this wonderful life 
utility also results in faster convergence, better scal- 
ing, and less sensitivity to parameters of the agents 1 
learning algorithms. We also showed that optimally 
choosing the action to which agents clamp (rather 
than arbitrarily choosing 0) provides significant fur- 
ther gains in performance, according to all of these 
performance measures. Future work involves investi- 
gating various ways of having the agents determine 
their optimal clamping vectors dynamically. 

References 

Arthur, W. B. (1994). Complexity in economic theory: 
Inductive reasoning and bounded rationality. The 
American Economic Review , 84 , 406-411. 

Baum, E. (1998). Manifesto for an evolutionary eco- 
nomics of intelligence. In C. M. Bishop (Ed.), Neural 
networks and machine learning. Springer- Verlag. 

Boutilier, C. (1999). Multiagent systems: Challenges 
and opportunities for decision theoretic planning. AI 
Magazine , 20 , 35-43. 

Boutilier, C., Shoham, Y., k Wellman, M. P. (1997). 
Editorial: Economic principles of multi-agent sys- 
tems. Artificial Intelligence Journal 94 , 1~6 

Bradshaw, J. M. (Ed.). (1997). Software agents . MIT 
Press. 

Caldarelli, G., Marsili, M., k Zhang, Y. C. (1997). 
A prototype model of stock exchange. Europhys . 
Letters , 40, 479-484. 

Challet, D., k Zhang, Y. C. (1998). On the minority 
game: Analytical and numerical studies. Physica A , 
256 , 514. 

Claus, C. t k Boutilier, C. (1998). The dynamics of 
reinforcement learning cooperative multiagent sys- 
tems. Proceedings of the Fifteenth National Confer- 
ence on Artificial Intelligence (pp. 746-752). 

Crites, R. H., k Barto, A. G. (1996). Improving eleva- 
tor performance using reinforcement learning. Ad- 
vances in Neural Information Processing Systems - 
8 (pp. 1017-1023). MIT Press. 

Fudenberg, D., k Tirole, J. (1991). Game theory. 
Cambridge, MA: MIT Press. 

Hardin, G. (1968). The tragedy of the commons. Sci- 
ence, 162, 1243-1248. 


Hu, J , k Wellman, M P (1998). Multiagent rein- 
forcement learning: Theoretical framework and an 
algorithm. Proceedings of the Fifteenth International 
Conference on Machine Learning (pp 242-250). 

Hubermann, B. A., k Hogg, T. (1988). The behavior 
of computational ecologies. In The ecology of com- 
putation , 77-115. North-Holland. 

Jennings, N. R., Sycara, K , k W'ooldridge, M. (1998). 
A roadmap of agent research and development. Au- 
tonomous Agents and Multi- Agent Systems , /, 7-38. 

Kaelbing, L. P, Littman, M. L., k Moore, A. W. 
(1996). Reinforcement learning: A survey. Journal 
of Artificial Intelligence Research ., 4> 237-285. 

Sandholm, T., Larson, K., Anderson, M., Shehory, O., 
k Tohme, F. (1998). Anytime coalition structure 
generation with worst case guarantees. Proceedings 
of the Fifteenth National Conference on Artificial 
Intelligence (pp. 46-53). 

Sen, S. (1997). Multi-agent learning : Papers from the 
1997 AAAI workshop (technical report WS-97-03. 
Menlo Park, CA: AAAI Press. 

Sutton, R. S., k Barto, A. G. (1998). Reinforcement 
learning: An introduction. Cambridge, MA: MIT 
Press. 

Sycara, K. (1998). Multiagent systems. A I Magazine, 
19, 79-92. 

Turner, K., k Wolpert, D. H. (2000). Collective intel- 
ligence and Braess’ paradox. . pre-print. 

Watkins, C., k Dayan, P. (1992). Q-learning. Machine 
Learning , 8 , 279-292. 

Wolpert, D. H., k Turner, K. (2000a). An Intro- 
duction to Collective Intelligence. In J. M. Brad- 
shaw (Ed.), Handbook of agent technology. AAAI 
Press/MIT Press. Available as tech. rep. NASA- 
ARC-IC-99-63 from http://ic.arc.na- 
sa. gov/ i c/projects/ coin-pubs . html. 

Wolpert, D. H., k Turner, K. (2000b). The mathemat- 
ics of collective intelligence, pre-print. 

Wolpert, D. H , Turner, K., k Frank, J. (1999a). Using 
collective intelligence to route internet traffic. Ad- 
vances in Neural Information Processing Systems - 
11 (pp. 952-958). MIT Press. 

Wolpert, D. H., Wheeler, K., k Turner, K. (1999b). 
General principles of learning- based multi- agent sys- 
tems. Proceedings of the Third International Con- 
ference of Autonomous Agents (pp. 77-83). 



